Bucket Joins

Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings.
  • SET hive.auto.convert.join=true; --default false
  • SET hive.optimize.bucketmapjoin=true; --default false 
In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables.
left_semi_join
+--------------+
¦ Id ¦  Data   ¦
+----+---------¦
¦  1 ¦ DataA11 ¦
¦  1 ¦ DataA12 ¦
¦  1 ¦ DataA13 ¦
¦  2 ¦ DataA21 ¦
+--------------+
CREATE TABLE mytable (
         name string,
         city string,
         employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS

then hive will store data in a directory hierarchy like

/user/hive/warehouse/mytable/y=2015/m=12/d=02

So, you have to be careful when partitioning, because if you for instance partition by employee_id and you have millions of employees, you'll end up having millions of directories in your file system. The term 'cardinality' refers to the number of possible value a field can have. For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be ~300. For a field like 'timestamp_ms', which changes every millisecond, cardinality can be billions. In general, when choosing a field for partitioning, it should not have a high cardinality, because you'll end up with way too many directories in your file system.

No comments:

Post a Comment